Trainging mod implementation (WIP)#2663
Conversation
f418c55 to
b86628c
Compare
const-t
left a comment
There was a problem hiding this comment.
I see that PR is WIP, but I have few comments for the future.
| */ | ||
| if (likely(!tfw_mode_is_disabled())) { | ||
| s = rcu_dereference(g_stats); | ||
| percpu_counter_add(&s->sum, delta1); |
There was a problem hiding this comment.
What a reason to use percpu_counter instead of simple per-cpu var? percpu_counter pretty large and has overhead, must be a reason to use it.
| @@ -0,0 +1,181 @@ | |||
| /** | |||
There was a problem hiding this comment.
I suggest renaming this to adaptive_limits.c or similar and use word "training" only in sense of "training mode" as the state of the adaptive limits.
| atomic_long_t max; | ||
| s64 __percpu *counter; | ||
| u16 epoch; | ||
| } TfwClientCounter; |
There was a problem hiding this comment.
From my point of view we should move this to training.h. All other related structs as well
| } | ||
|
|
||
| static bool | ||
| tfw_client_counter_training_check(TfwClientCounter *counter, |
There was a problem hiding this comment.
It seems client.c not the right place for this function. I would prefer to have it in training.c
| return defence(curr); | ||
|
|
||
| if (tfw_client_counter_change_max(counter, curr, &delta1, &delta2)) | ||
| adjust_num(delta1, delta2); |
There was a problem hiding this comment.
I would suggest moving update of the global stats to the tfw_http_conn_recv_finish(), we don't need live update of the counter during training
4b8f8f9 to
96e0ae8
Compare
4681521 to
40ac0a7
Compare
e48e696 to
cd3f102
Compare
cd3f102 to
5f843e6
Compare
55c3eab to
127ad54
Compare
Introduce helper functions for 128-bit arithmetic that are not provided by the Linux kernel: - 128/32 division using bitwise long division; - integer square root using binary search. The library is required for training mode statistics collection, where aggregating metrics across a large number of clients can overflow 64-bit intermediate values. An evaluation comparing the sum/sumsq and Welford algorithms using both 64-bit and 128-bit arithmetic showed that 64-bit implementations become inaccurate for workloads with approximately 100,000 or more clients due to intermediate overflows, while both 128-bit implementations match the exact results across all tested workloads Accuracy results: client maximum increases +1 on each iteration (same as expected for connection tracking): exact = 8.33e+08 sum/sumsq (128-bit) = 8.33e+08 Welford (128-bit) = 8.33e+08 sum/sumsq (64-bit) = 8.33e+08 Welford (64-bit) = 32.4295 client maximum randomly increases in a range (1 - 10) on each iteration (possible for non-idempotent request tracking): exact = 2.53805e+10 sum/sumsq (128-bit) = 2.53805e+10 Welford (128-bit) = 2.53805e+10 sum/sumsq (64-bit) = -2.95145e+15 Welford (64-bit) = 32.43 client maximum randomly increases in a range (1 - 100) on each iteration: exact = 2.12403e+12 sum/sumsq (128-bit) = 2.12403e+12 Welford (128-bit) = 2.12403e+12 sum/sumsq (64-bit) = -2.52534e+17 Welford (64-bit) = 32.4224 client maximum randomly increases in a range (1 - 1000) on each iteration (possible for memory usage tracking, since we are planning to track memory usage in pages): exact = 2.08852e+14 sum/sumsq (128-bit) = 2.08852e+14 Welford (128-bit) = 2.08852e+14 sum/sumsq (64-bit) = -2.47926e+19 Welford (64-bit) = 32.419 Part-of: training/defence mode implementation Issue: #1346
Add a generic training/defence subsystem used to detect abnormal
behavior based on z-score statistics.
The implementation provides:
- training mode: collect per-event statistics (sum, sumsq, count)
using percpu counters to minimize contention;
- defence mode: evaluate incoming values against calculated mean/std
and reject anomalies exceeding configured z-score threshold (drop
connection with TCP RST);
Use adaptive limits (training/defence) library with per-client connection
tracking. Maintain current and maximum number of concurrent connections
per client and update statistic on each new maximum of concurrent
client connections. In defence mode calculate z-score for the
client on each new established connection and drop connection if
z-score exceeded configured threshold.
The classical Welford algorithm was evaluated but found unsuitable for
this workload. In its original form Welford assumes an append-only stream
of samples, where each new observation increases the sample count.
In our case, "n" represents the number of clients rather than the number
of events. For each client we continuously update the current maximum
number of connections/requests/memory/cpu usage. When a value changes,
the previous sample must be removed from the aggregated statistics before
the updated value is inserted. This requires a replace/update operation
rather than append-only updates, which implies a reversible variant of
Welford’s algorithm and significantly increases implementation complexity.
We therefore use a sum/sumsq based approach.
Although sum/sumsq is generally considered less numerically stable than
Welford’s algorithm due to potential catastrophic cancellation when
subtracting large nearly equal values, this is not a concern in our case.
For the expected value ranges in production workloads, such pathological
distributions (e.g. values clustered around 1e9 with variance ≈ 1) are
not realistic, and numerical precision remains sufficient.
Part-of: training/defence mode implementation
Issue: #1346
Use the adaptive limits framework to track per-client in-flight non-idempotent requests, since only such requests occupy upstream connections and therefore are suitable for overload detection. Introduce `TfwAdaptiveLimitLock`, a generic adaptive limit structure with a per-CPU counter, per-epoch maximum tracking, and synchronization for training epoch transitions. Extend the adaptive limits library with helpers for request accounting and z-score calculation, reusing the existing logic. Tracking of in-flight non-idempotent requests is performed in two stages: - We account non-idempotent requests in the HTTP layer by incrementing the counter when a non-idempotent request is queued and decrementing it once the request completes. On this stage the current request count is updated using per-CPU counters without acquiring any locks. - The second stage occurs in the `on_rcv_finish` callback at the end of `ss_tcp_process_data`. At this point, the current number of in-flight requests is obtained by aggregating all per-CPU counters. If the aggregated value exceeds the previously recorded maximum, the maximum is updated atomically and the corresponding deltas are applied to the global `sum` and `sumsq` statistics. This agregated value is also used in defence mode for z-score calculation and deciding whether the client should be blocked. This approach avoids expensive synchronization on every request while still maintaining accurate client maxima for statistical analysis. Part-of: training/defence mode implementation Issue: #1346
Add per-socket training_epoch field to track the training generation for connection-related statistics. This allows associating socket events with a specific training period and prevents mixing measurements across training epochs when switching between TRAINING and DEFENCE modes.
Extend the adaptive limits framework to track per-client CPU usage during request/responce processing and use it as an additional overload detection metric. Introduce a CPU adaptive limit based on `TfwAdaptiveLimitLock` and integrate it into the existing training and defence infrastructure. Unlike request tracking, CPU usage is accumulated using an exponential moving average (EMA), which provides a stable estimate of client CPU consumption without introducing synchronization overhead. (A simple counter would grow monotonically throughout the lifetime of a client, making it unsuitable for anomaly detection. The EMA provides a bounded and continuously adapting estimate of recent CPU activity). CPU usage is tracked in two places: - Measure processing time by recording CPU cycles at the beginning of `ss_tcp_process_data()` and calculating the elapsed time in the `conn_recv_finish` callback after all received data has been processed. The measured delta is used to update the client's CPU usage statistics. (This is a primary accounting path). - CPU usage is also accounted during response processing in `tfw_http_msg_process_generic`. In this case, CPU cycles are measured at the function entry and exit. During training, aggregate per-CPU EMA values, update the recorded maximum CPU usage, and adjust the global statistical model. During defence mode, calculate the client's CPU usage z-score and drop the connection when it exceeds the configured threshold. Reuse the existing adaptive limits infrastructure and IP blocking mechanism for enforcement. Part-of: training/defence mode implementation Issue: #1346
Use training library for client memory usage tracking. Use `TfwAdaptiveLimitLock` structure for client memory usage tracking. In defence mode in `tfw_http_conn_recv_finish` callback calculate z-score, compare it with configured `threshold` and drop client connection if necessary (same as we do for non-idempotent requests). Current approach with per-cpu request accounting prevent performance degradation. Pay attention that we also adjust memory usage in per-cpu `mem` storage to check `soft` and `hard` mem limits. We should do it in other storage, because we zero `TfwAdaptiveLimitLock` on the start of the new training and do not account events from previous trainging in `TfwAdaptiveLimitLock`. Performance measurements for the whole patchset were made and show no measurable regression: Training: finished in 50.03s, 1205382.84 req/s, 933.22MB/s finished in 50.03s, 1206352.90 req/s, 935.01MB/s finished in 50.03s, 1212849.66 req/s, 940.37MB/s Defense: finished in 50.03s, 1202041.02 req/s, 931.99MB/s finished in 50.03s, 1221799.64 req/s, 947.31MB/s finished in 50.02s, 1214020.14 req/s, 941.28MB/s Master: finished in 50.03s, 1204474.98 req/s, 932.55MB/s finished in 50.03s, 1214912.74 req/s, 941.36MB/s finished in 50.03s, 1221197.26 req/s, 946.84MB/s Part-of: training/defence mode implementation Issue: #1346
127ad54 to
6652af4
Compare
No description provided.